Hello, World!

The following command prints, “Hello, world!”

print("Hello, world!")
## [1] "Hello, world!"

Installing Libraries

Installing the necessary libraries - tidyverse

library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.1     ✔ purrr   0.3.2
## ✔ tibble  2.1.3     ✔ dplyr   0.8.3
## ✔ tidyr   0.8.3     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0
## ── Conflicts ──────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

##Load Data

load("college.Rdata")

List of variables in the college.Rdata dataset:

Variable Name :Definition

instnm: Institution Name

stabbr: State Abbreviation

year: Year

control: control of institution, 1=public, 2= private non-profit, 3=private for-profit

preddeg: predominant degree, 1= certificate, 2= associates, 3= bachelor’s, 4=graduate

adm_rate: Proportion of Applicants Admitted

sat_avg: Midpoint of entrance exam scores, on SAT scale, math and verbal only

costt_4a: Average cost of attendance (tuition and room and board less all grant aid)

debt_mdn: Median debt of graduates

md_earn_wne_p6: Earnings of graduates who are not enrolled in higher education, six years after graduation

ugds: number of undergraduates

Homework Questions

Question 1. Calculate the average earnings for individuals at the most selective and least selective colleges in the dataset.

The following code filters for schools whose admission rate is greater than 30%, a condition we deem appropriate for schools being the least selective. Within those schools, I tasked Rstudio to create a variable that calculated the average earnings of graduates, six years after graduation.

Then, I selected the schools whose admission rates is less than 10%, signifying that these schools are the most selective. Within this group of schools, Rstudio created a variable that calculated the average earnings of graduates, six years after graduation.

## What's the average earnings for individuals at the least selective schools?
sc%>%filter(adm_rate>.3)%>%summarize(md_earn_wne_p6=mean(md_earn_wne_p6,na.rm=TRUE))
## # A tibble: 1 x 1
##   md_earn_wne_p6
##            <dbl>
## 1         34747.
## What's the average earnings for individuals at the most selective schools?
sc%>%filter(adm_rate<.1)%>%summarize(md_earn_wne_p6=mean(md_earn_wne_p6,na.rm=TRUE))
## # A tibble: 1 x 1
##   md_earn_wne_p6
##            <dbl>
## 1          53500

Answer: The average earnings for individuals at the most and least selective colleges is 53500 and 34747, respectively.

Question 2. Determine whether colleges with very high SAT scores tend to be larger or smaller than colleges with low SAT scores

College size can be interpreted using the variable, ugds, the number of undergraduates. Dell’Arte International School of Physical Theatre has the smallest undergraduate program of only 27 students. Florida International University has the largest number of undergraduates, 30,920 students. To determine if colleges with high SAT scores tend to be larger or smaller than colleges with low SAT scores, I decided to create a scatterplot graphic to examine the relationship between school size and average SAT score.

The following code, ggplot, will create a graphics object. I will name it gg. Within this, I declare the dataset (‘sc’) the x variable (‘sat_avg’) and the y variable (‘ugds’). I will also note that I want to use the institutional name (‘instnm’) as text. The last couple lines of code makes this scatterplot interactive, which makes it possible to put a mouse cursor over a particular point and see what university it corresponds to.

gg<-ggplot(data=sc, aes(x=sat_avg,y=ugds,text=instnm))
gg<-gg+geom_point(alpha=.5,size=.5)
gg<-gg+xlab("Average SAT")+ylab("Number of Undergraduates")
gg<-gg+ggtitle("Median Student Debt and Average Cost of Tuition")
gg
## Warning: Removed 27 rows containing missing values (geom_point).

gg_p<-ggplotly(gg)
gg_p

The scatterplot allows me to holistically see, in our dataset, to what extent average SAT score influenced the size of college. From this, I decided to filter the dataset, selecting for those universities with the highest SAT scores and the lowest SAT scores. Seeing the scatterplot allowed me to confidently assign that colleges with the highest SAT scores are those whose SAT scores are greater than 1400. Colleges with the lowest SAT scores are those whose SAT scores is less than 900.

sc%>%filter(sat_avg>1400)%>%select(instnm,sat_avg,ugds)%>%arrange(-ugds)
## # A tibble: 20 x 3
##    instnm                                      sat_avg  ugds
##    <chr>                                         <dbl> <int>
##  1 University of Pennsylvania                     1436 10842
##  2 Northwestern University                        1427  8905
##  3 University of Notre Dame                       1433  8367
##  4 Columbia University in the City of New York    1445  7743
##  5 Harvard University                             1468  7181
##  6 Emory University                               1403  6868
##  7 Vanderbilt University                          1430  6764
##  8 Stanford University                            1436  6564
##  9 Washington University in St Louis              1462  6436
## 10 Duke University                                1440  6416
## 11 Brown University                               1420  6013
## 12 Yale University                                1475  5258
## 13 Tufts University                               1450  5146
## 14 University of Chicago                          1425  5101
## 15 Princeton University                           1482  5029
## 16 Massachusetts Institute of Technology          1472  4218
## 17 Dartmouth College                              1432  4090
## 18 Rice University                                1425  3279
## 19 Williams College                               1424  2033
## 20 California Institute of Technology             1514   951

Of the 20 colleges whose average SAT scores is higher than 1400, most have a fairly high number of undergraduate students - 11 of them have more than 6000 undergraduate students. California Institute of Technology is unlike the rest, in that it only enrolls 951 undergraduate students.

sc%>%filter(sat_avg<900)%>%select(instnm,sat_avg,ugds)%>%arrange(-ugds)
## # A tibble: 11 x 3
##    instnm                                     sat_avg  ugds
##    <chr>                                        <dbl> <int>
##  1 California State University-San Bernardino     894 14373
##  2 New Jersey City University                     835  6328
##  3 Grambling State University                     851  4534
##  4 Albany State University                        876  3988
##  5 University of Arkansas at Pine Bluff           784  3624
##  6 Delaware State University                      868  3222
##  7 Central State University                       759  2398
##  8 Mississippi Valley State University            825  2350
##  9 Kentucky State University                      823  2326
## 10 Lincoln University                             812  2020
## 11 Claflin University                             895  1735

Of the 11 colleges whose average SAT scores is lower than 900, ten of them have less than 5000 undergraduate students. Overall, when comparing schools with very high and very low SAT scores, it appears that colleges with very high SAT scores (greater than 1400) tend to be larger in school size.

Based on a previous email exchange with Professor Bolton, I would like to answer this question with a slightly different approach. The institution’s admission rate can serve as a proxy for college size. The rationale behind this is that an institution with a higher admission rate will have more students enrolled. Institutions who are more selective and have a lower admission rate will have fewer students enrolled. The following code is used to filter for institutions with a high admission rate (greater than 0.35).

sc%>%filter(sat_avg>1400)%>%select(instnm,sat_avg,adm_rate)%>%arrange(-adm_rate)
## # A tibble: 20 x 3
##    instnm                                      sat_avg adm_rate
##    <chr>                                         <dbl>    <dbl>
##  1 University of Notre Dame                       1433   0.286 
##  2 University of Chicago                          1425   0.273 
##  3 Emory University                               1403   0.266 
##  4 Tufts University                               1450   0.266 
##  5 Northwestern University                        1427   0.262 
##  6 Duke University                                1440   0.224 
##  7 Rice University                                1425   0.223 
##  8 Washington University in St Louis              1462   0.222 
##  9 Williams College                               1424   0.204 
## 10 Vanderbilt University                          1430   0.202 
## 11 University of Pennsylvania                     1436   0.177 
## 12 California Institute of Technology             1514   0.153 
## 13 Dartmouth College                              1432   0.135 
## 14 Brown University                               1420   0.112 
## 15 Columbia University in the City of New York    1445   0.110 
## 16 Massachusetts Institute of Technology          1472   0.107 
## 17 Princeton University                           1482   0.101 
## 18 Yale University                                1475   0.0856
## 19 Stanford University                            1436   0.0797
## 20 Harvard University                             1468   0.0719
sc%>%filter(sat_avg<900)%>%select(instnm,sat_avg,adm_rate)%>%arrange(-adm_rate)
## # A tibble: 11 x 3
##    instnm                                     sat_avg adm_rate
##    <chr>                                        <dbl>    <dbl>
##  1 Central State University                       759    0.389
##  2 Delaware State University                      868    0.371
##  3 Claflin University                             895    0.360
##  4 New Jersey City University                     835    0.348
##  5 Grambling State University                     851    0.333
##  6 University of Arkansas at Pine Bluff           784    0.327
##  7 Lincoln University                             812    0.310
##  8 Albany State University                        876    0.299
##  9 Mississippi Valley State University            825    0.295
## 10 Kentucky State University                      823    0.242
## 11 California State University-San Bernardino     894    0.241

It is not easy to conclude whether institutions with high SAT scores tend to be larger or smaller, when we use admission rate as a proxy for school size. Of the 20 universities whose average SAT score is higher than 1400, half of them have an admission rate greater than 0.20 and half have an admission rate lower than 0.20. Of the 11 institutions whose average SAT scores is lower than 900, all of their admission rate ranges from 0.24 to 0.38. Is is likely that institutions with low average SAT scores will have a higher admission rate, so it’s possible that their school size is also larger.

Question 3. Plot the relationship between cost and debt. What do you see? Does this surprise you?

To examine the relationship between the cost of attendance and median student debt, I will use the following code, ggplot, to create a graphics object. Using the dataset (‘sc’) the x variable (‘debt_mdn’) and the y variable (‘costt4_a’). I will also note that I want to use the institutional name (‘instnm’) as text. The last couple lines of code makes this scatterplot interactive, which makes it possible to put a mouse cursor over a particular point and see what university it corresponds to. This graphic is stored as “gg1.”

gg<-ggplot(data=sc, aes(x=costt4_a,y=debt_mdn,text=instnm))
gg<-gg+geom_point(alpha=.5,size=.5)
gg<-gg+xlab("Average Cost of Attendance")+ylab("Median Debt of Graduates")
gg
## Warning: Removed 1 rows containing missing values (geom_point).

gg_p<-ggplotly(gg)
gg_p
gg1<-gg_p

On first impression, one might presume that if the average cost of attendance is higher, then the median debt of graduates will also be higher. The scatterplot demonstrates that to some extent. Several universities whose average cost of attendance is in the range of 50k has the highest median debt of graduates, around 16k to 17k. In fact, the majority of graduates with median debt of less than 15,000 dollars receive a diploma from schools whose average cost of attendance is 40,000 dollars or less. Furthermore, schools with the least average cost of attendance, around 10,000 dollars have the smallest median of debt. In examining the extremes, schools with the highest and lowest average cost of attendnce, their graduates have the highest and lowest median debt, respectively. This aligns with my initial assumptions.

There are a couple elements in the scatterplot that surprise me. First, I did not expect there to be such a wide range of median debt among the universities whose average cost of attendance was around 50,000. For example, the average cost of attending Harvard University is 50,250 but the median debt of its graduates is only 6000. Yet, students who graduate from the California Institute of the Arts, whose average cost of attendance is 48,784, finish with a median debt of 18187.5. It would be interesting to explore to what extent their graduates were given grants and scholarships to support their educational expenses, but this information was not provided in the current dataset.

Second, I did not anticipate schools whose average cost of attendance is between 13,000 - 35,000 dollars to have graduates with such a similar median debt amount. It baffles me to think that graduates from Albany State University to have a median debt similar to graduates from Rocky Mountain of Art and Design, even though the average cost of attending Rocky Mountain of Art and Design is almost 21,000 dollars more expensive than Albany State University.

Question 4. Provide separate plots for cost and debt by control of the institution.

The dataset includes three types of institutions, public, private non-profit and private for-profit. This characteristic is classified under the variable (‘control’).

control: control of institution, 1=public, 2= private non-profit, 3=private for-profit

The following code will tell me how many public, private non-profit, and private for-profit schools there are in this dataset.

table(sc$control)
## 
##  1  2  3 
## 35 83  7

There are 35 public, 83 private non-profit, and 7 private for-profit schools in this dataset.

I am interested in examining the relationship between average cost of attendance and median debt of graduates for the public institutions, only. To do this, I have to use both the ‘filter’ and ‘select’ commands. I first select only public institutions. Within that, I select the average cost of attendance and median debt of graduates. The data is presented in desending order of cost. This subset of data is stored as ‘public.’

sc%>%filter(control=="1")%>%
  select(instnm,costt4_a,debt_mdn)%>%arrange(-costt4_a)
## # A tibble: 35 x 3
##    instnm                                                 costt4_a debt_mdn
##    <chr>                                                     <int>    <dbl>
##  1 University of California-Berkeley                         26275   12312.
##  2 University of California-Los Angeles                      24725   11523 
##  3 University of California-San Diego                        23433   13394.
##  4 College of William and Mary                               20806   13335 
##  5 University of Virginia-Main Campus                        20488   12000 
##  6 Lincoln University                                        19126   15687 
##  7 California Polytechnic State University-San Luis Obis…    18978   11958 
##  8 SUNY at Purchase College                                  18116   14000 
##  9 SUNY at Binghamton                                        17956   12625 
## 10 State University of New York at New Paltz                 17779   12500 
## # … with 25 more rows
Public <- sc%>%filter(control=="1")%>%select(instnm,debt_mdn,costt4_a)

There are 35 public institutions in the dataset. This information can be visually shown in a scatterplot. The following code uses the command ggplot to create a graphics object. Layering this with “geom_point” tells the program to present a scatterplot. The image is stored as “gg_pub” for later use.

gg<-ggplot(data=Public,aes(x=costt4_a,y=debt_mdn,text=instnm))
gg<-gg+geom_point(alpha=.5,size=.5)
gg<-gg+xlab("Average Cost of Attendance")+ylab("Median Debt")
gg<-gg+ggtitle("Debt and Cost in Public Universities")
gg

gg_pub <-gg

The scatterplot above demonstrates a positive relationship between average cost of attendance and median debt in the 35 universities.

I am now interested in examining the relationship cost and debt of graduates for the private non-profit institutions, only. The following code filters for private non-profit institutions, saves it as a separate dataset named ‘privatenp.’ The ggplot command will use only the data from this group of 83 schools to create a scatterplot graphics object.

sc%>%filter(control=="2")%>%
  select(instnm,costt4_a,debt_mdn)%>%arrange(-costt4_a)
## # A tibble: 83 x 3
##    instnm                            costt4_a debt_mdn
##    <chr>                                <int>    <dbl>
##  1 Georgetown University                53425    14500
##  2 George Washington University         52707    15021
##  3 Washington University in St Louis    52464    14499
##  4 Middlebury College                   52460    11250
##  5 University of Chicago                52450    13126
##  6 Vanderbilt University                52303    12625
##  7 Carnegie Mellon University           52150    17125
##  8 Northwestern University              52080    12500
##  9 Boston College                       52007    17125
## 10 Wesleyan University                  51935    16690
## # … with 73 more rows
privatenp <- sc%>%filter(control=="2")%>%select(instnm,debt_mdn,costt4_a)
gg<-ggplot(data=privatenp,aes(x=costt4_a,y=debt_mdn,text=instnm))
gg<-gg+geom_point(alpha=.5,size=.5)
gg<-gg+xlab("Average Cost of Attendance")+ylab("Median Debt")
gg<-gg+ggtitle("Debt and Cost in Private Non-Profit Universities")
gg
## Warning: Removed 1 rows containing missing values (geom_point).

gg_privatenp <-gg

The scatterplot also shows a positive relationship between average cost of attendance and median debt. It’s important to note that many of the private non-profit schools have an average cost of attendance of around 50k. And, among this tightly grouped schools, there is a wide range of median debt.

The following code is used the generate the last plot, showing the relationship between median debt and average cost in eight private for-profit schools. This visual is saved as gg_privateprof for later use.

sc%>%filter(control=="3")%>%
  select(instnm,costt4_a,debt_mdn)%>%arrange(-costt4_a)
## # A tibble: 7 x 3
##   instnm                                                  costt4_a debt_mdn
##   <chr>                                                      <int>    <dbl>
## 1 South University-The Art Institute of Dallas               40851    9500 
## 2 Argosy University-The Art Institute of California-San …    35858    9616.
## 3 Schiller International University                          35408    6500 
## 4 Rocky Mountain College of Art and Design                   34589   11562.
## 5 University of Advancing Technology                         32054   11625 
## 6 DigiPen Institute of Technology                            23969   16125 
## 7 The National Hispanic University                           19135    5500
privateprof <- sc%>%filter(control=="3")%>%select(instnm,debt_mdn,costt4_a)
gg<-ggplot(data=privateprof,aes(x=costt4_a,y=debt_mdn,text=instnm))
gg<-gg+geom_point(alpha=.5,size=.5)
gg<-gg+xlab("Average Cost of Attendance")+ylab("Median Debt")
gg<-gg+ggtitle("Debt and Cost in Private For-Profit Universities")
gg

gg_privateprof <-gg